Remove remaining uses of FFI under -fpure-haskell #660

clyring · 2024-02-14T22:48:00Z

All of these were standard C functions that GHC's JS backend actually somewhat supports; their shims can be found in the compiler source at "rts/js/mem.js". But it seems simpler to just get rid of all FFI uses with -fpure-haskell rather than try to keep track of which functions GHC supports.

The pure Haskell implementation of memcmp runs about 6-7x as fast as the simple one-byte-at-a-time implementation for long equal buffers, which makes it... about the same speed as the pre-existing shim, even though the latter is also a one-byte-at-a-time implementation!

Apparently GHC's JS backend is not yet able to produce efficient code for tight loops like these yet; the biggest problem is that it does not perform any loopification so each iteration must go through a generic-call indirection.

Unfortunately that means that this patch probably makes 'strlen' and 'memchr' much slower with the JS backend.

(I noticed this situation while working on #569.)

(This is based on top of #659 to avoid pointless CPP.)

All of these were standard C functions that GHC's JS backend actually somewhat supports; their shims can be found in the compiler source at "rts/js/mem.js". But it seems simpler to just get rid of all FFI uses with -fpure-haskell rather than try to keep track of which functions GHC supports. The pure Haskell implementation of memcmp runs about 6-7x as fast as the simple one-byte-at-a-time implementation for long equal buffers, which makes it... about the same speed as the pre-existing shim, even though the latter is also a one-byte- at-a-time implementation! Apparently GHC's JS backend is not yet able to produce efficient code for tight loops like these yet; the biggest problem is that it does not perform any loopification so each iteration must go through a generic-call indirection. Unfortunately that means that this patch probably makes 'strlen' and 'memchr' much slower with the JS backend.

clyring · 2024-02-14T23:32:59Z

cc @hsyl20

Data/ByteString/Internal/Type.hs

hsyl20 · 2024-02-15T16:04:40Z

@luite told me it would be a big performance hit for JS. Or we need to figure out how to optimize recursive functions like these before.

As an alternative we could perhaps define GHC primops for common libc operations? That would make bytestring pure Haskell too.

luite · 2024-02-15T16:10:46Z

@luite told me it would be a big performance hit for JS. Or we need to figure out how to optimize recursive functions like these before.

As an alternative we could perhaps define GHC primops for common libc operations? That would make bytestring pure Haskell too.

Or perhaps something provided by base or ghc-internal

doyougnu · 2024-02-15T16:17:16Z

Alternatively, if GHC performed loopification via join points: https://gitlab.haskell.org/ghc/ghc/-/issues/14068 then we would get it for free.

clyring · 2024-02-15T16:19:17Z

We definitely should have a primop for memcmp; I dunno about strlen or memchr. But anyway the JS backend really should be able to optimize simple Haskell loops like these to something about as good as the shims currently in rts/js/mem.js.

There actually is a version of strlen in base, namely cstringLength#. It has the wrong type for use in I/O or on mutable buffers, but I guess we could use the same lazy runRW# hack as in deferForeignPtrAvailability to force it to be well-sequenced.

clyring · 2024-02-15T16:21:30Z

Alternatively, if GHC performed loopification via join points: https://gitlab.haskell.org/ghc/ghc/-/issues/14068 then we would get it for free.

It would be relatively trivial to write an StgToStg pass that writes loopification using join points. I can prepare a GHC patch in a few weeks if you like.

clyring · 2024-02-15T17:46:39Z

It looks like the tail calls of join points get turned into trampolines, which is better than the status quo for non-join-pointed tail calls but still not as good as we'd like for basic self-loops like these, which should be very efficiently implementable by wrapping them in while(true){ /* ... */ break; } and using continue to self-tail-call. Does that sound feasible? If so, let's make a feature ticket on the ghc tracker. (I know that the continuation-wrangling of a "real" function call might get in the way, but there are no such obstacles in this memcmp implementation.)

Also, we should be able to just trampoline for known exactly-saturated function calls in tail position instead of producing generic-call code, even if we are not calling a join point. Right?

In any case, my inclination is to just accept these performance regressions for now.

doyougnu · 2024-02-15T18:48:31Z

opened: https://gitlab.haskell.org/ghc/ghc/-/issues/24442

my inclination is to just accept these performance regressions for now.

sounds good to me.

* WIP: Prepare changelog for 0.12.1.0 * fiddle with CI * Revert "fiddle with CI" This reverts commit 3e22005. * More changelog updates * Mention `pure-haskell` flag in Changelog.md * Add hidden entry for #660

All of these were standard C functions that GHC's JS backend actually somewhat supports; their shims can be found in the compiler source at "rts/js/mem.js". But it seems simpler to just get rid of all FFI uses with -fpure-haskell rather than try to keep track of which functions GHC supports. The pure Haskell implementation of memcmp runs about 6-7x as fast as the simple one-byte-at-a-time implementation for long equal buffers, which makes it... about the same speed as the pre-existing shim, even though the latter is also a one-byte- at-a-time implementation! Apparently GHC's JS backend is not yet able to produce efficient code for tight loops like these yet; the biggest problem is that it does not perform any loopification so each iteration must go through a generic-call indirection. Unfortunately that means that this patch probably makes 'strlen' and 'memchr' much slower with the JS backend. (cherry picked from commit 305604c)

* WIP: Prepare changelog for 0.12.1.0 * fiddle with CI * Revert "fiddle with CI" This reverts commit 3e22005. * More changelog updates * Mention `pure-haskell` flag in Changelog.md * Add hidden entry for #660 (cherry picked from commit 314e257)

clyring added 2 commits February 14, 2024 17:21

Move all endianness/byte-order CPP into one module

7e87911

Bodigrim reviewed Feb 15, 2024

View reviewed changes

Data/ByteString/Internal/Type.hs Outdated Show resolved Hide resolved

clyring added 2 commits February 14, 2024 21:53

Merge branch 'master' into really-no-ffi

1880581

Eliminate unused-import warning

747324f

Bodigrim approved these changes Feb 15, 2024

View reviewed changes

clyring merged commit 305604c into haskell:master Feb 15, 2024
26 checks passed

clyring added a commit to clyring/bytestring that referenced this pull request Feb 15, 2024

Add hidden entry for haskell#660

ef2272a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove remaining uses of FFI under -fpure-haskell #660

Remove remaining uses of FFI under -fpure-haskell #660

clyring commented Feb 14, 2024

clyring commented Feb 14, 2024

hsyl20 commented Feb 15, 2024

luite commented Feb 15, 2024

doyougnu commented Feb 15, 2024

clyring commented Feb 15, 2024

clyring commented Feb 15, 2024

clyring commented Feb 15, 2024

doyougnu commented Feb 15, 2024

Remove remaining uses of FFI under -fpure-haskell #660

Remove remaining uses of FFI under -fpure-haskell #660

Conversation

clyring commented Feb 14, 2024

clyring commented Feb 14, 2024

hsyl20 commented Feb 15, 2024

luite commented Feb 15, 2024

doyougnu commented Feb 15, 2024

clyring commented Feb 15, 2024

clyring commented Feb 15, 2024

clyring commented Feb 15, 2024

doyougnu commented Feb 15, 2024